Executive Summary

Through the use of data analysis techniques, I was able to form a picture of how profitability on short term rentals changes across New York City, in particular for zip codes. Using the result, the clients will be able to better understand the market and assess risks. These results can also be used in making decisions under different scenarios and time lines. An interactive Shiny app has also been provided to give more intuitive insights.

Key Insights and Conclusions

Recommendations

Package Loading

Required Packages

  • Tidyverse (dplyr, ggplot2..) - Data Read, Manipulation and visualisation
  • Caret - Pre Processing, Feature Selection
  • Plotly - Interactive Visualization
  • ggmap - For geographic visualization
  • KableExtra - Styling for Kable (Interactive Data Tables within Markdown)
  • Shiny - for building shiny app later

Data Loading

setwd("~/Desktop/apply for job!!!/company/Capital One/airbnb-zillow-data-challenge-master")
#setwd("working directory")
airbnb <- fread('listings.csv',header=T,na.strings=c(""))
# for windows users, the file you download may be 'listings.csv.gz'
# airbnb <- fread('listings.csv',header=T,na.strings=c("")) #using package R.utils
zillow <- fread('Zip_Zhvi_2bedroom.csv',header=T,na.strings=c(""))

Data Preparation & Data Quality Check

Cost and Revenue datasets are handled separately in an attempt to enrich the data quality for exploratory data analysis.

Revenue Data

Revenue data contains a mix of information including details about the properties like address, zipcode, bedrooms, bathrooms to information about host, daily/weekly and monthly price details for stay.

Dimension of Revenue Data, Summary and Check for NAs and unique values.

Remove Columns

  • First non-relevant columns are removed
  • Remove character columns that contain more than 20% of unique observations given considerations of missing values in the character columns. 11 columns are removed.
  • Remove Columns that have near zero variance. 9 columns are removed. Therefore, 33 columns are left in the dataset.

Based on Relevance

Remove the columns that start with “require” or “host” and columns that end with “url” and “nights” as it’s found that these columns are not relevant with the problem in this case. 28 columns are removed.

pattern <-
  c(
    colnames(
  airbnb %>% select(
  starts_with("require"),
  starts_with("host"),
  ends_with("url"),
  ends_with("nights"))
  ))
airbnb1 <- airbnb[, (pattern) := NULL]
pattern
##  [1] "requires_license"                 "require_guest_profile_picture"   
##  [3] "require_guest_phone_verification" "host_id"                         
##  [5] "host_url"                         "host_name"                       
##  [7] "host_since"                       "host_location"                   
##  [9] "host_about"                       "host_response_time"              
## [11] "host_response_rate"               "host_acceptance_rate"            
## [13] "host_is_superhost"                "host_thumbnail_url"              
## [15] "host_picture_url"                 "host_neighbourhood"              
## [17] "host_listings_count"              "host_total_listings_count"       
## [19] "host_verifications"               "host_has_profile_pic"            
## [21] "host_identity_verified"           "listing_url"                     
## [23] "thumbnail_url"                    "medium_url"                      
## [25] "picture_url"                      "xl_picture_url"                  
## [27] "minimum_nights"                   "maximum_nights"

Based on Unique Values - Character Columns

Character columns with near 100% variance (Every Row is different) are removed as they provide no group level information that can be used on a larger population/scale. These columns include textual columns describing the home, host, amenties etc.

uniquedf <-
  airbnb1 %>% select_if(is.character) %>% summarise_all(funs(n_distinct(.))) 

uniquedf <-
  uniquedf %>% gather(key = var_name, value = value, 1:ncol(uniquedf))

uniquedf$percentUnique <- round(uniquedf$value/nrow(airbnb1),2)
uniqueval <- uniquedf %>% filter(percentUnique > 0.2) %>% pull(var_name)
airbnb2 <- airbnb1[, (uniqueval) := NULL]

uniqueval
##  [1] "name"                  "summary"              
##  [3] "space"                 "description"          
##  [5] "neighborhood_overview" "notes"                
##  [7] "transit"               "access"               
##  [9] "interaction"           "house_rules"          
## [11] "amenities"

Based on Vairance

Get the matrics for vairables that have near zero variance using caret package and remove them.

zvdf <- nearZeroVar(airbnb2, saveMetrics = TRUE)
ZVnames=rownames(subset(zvdf, nzv== TRUE))
airbnb3 <- airbnb2[ , (ZVnames) := NULL]

ZVnames
## [1] "scrape_id"           "experiences_offered" "state"              
## [4] "market"              "country_code"        "country"            
## [7] "bed_type"            "has_availability"    "license"

Remove 14 Columns Manually

At last, 14 columns are removed manually by going through the dictionary of the dataset based on the relevance with business problem.

Given six aspects for data quality check, several data cleaning approaches are taken for important columns.

  • Zip code
    • Replace 567 missing zip codes with newly generated zip codes using latitude and longitude to have a complete dataset.
    • Unify zip codes to 5 digits to maintain consistency and conformity across the dataset.
  • Number of bedrooms
    • Filter out the bedrooms that don’t equal to two to keep integrity for future merger with cost dataset, which only has cost for two-bedroom properties.
  • Price
    • Remove dollar and comma sign and convert it to numeric format to maintain conformity.
    • Convert price for listings that have room type as “private room” by multiplying by two and get a new column “price_new”.
    • For visualization, Cap for outliers that lie outside the 1.5 * IQR limits, replace those observations outside the lower limit with the value of 5th percentile and those that lie above the upper limit, with the value of 95th percentile.
  • first review/last review
    • Convert to date format.
    • Change the date to year , I found that the time these listings in all four boroughs started rental business is from 2001 to 2017. Most of them started during 2014 to 2016.
    • There are more than 25% of missing values in this variable.
    • Given these two observations, I used total number of reviews as a proxy for demand later.

After data cleaning, the dataset has 149 unique zip code.

Zip Code

  • Replace Missing Values After replace the missing zip codes, check if there’s missing value in the zip code column.
############################################################################################# Zipcode
# 206 unique zipcodes, 567 missing
# replace missing with lat and long since there are no NA for lat and long
# 1. select NA values 
zipcode.na <- revenue[,c('id','zipcode','latitude','longitude')]
zipcode.na <- zipcode.na[is.na(zipcode.na$zipcode), ]
# 2. generate zip code based on lat and long 
result <- do.call(rbind,
                  lapply(1:nrow(zipcode.na),
                         function(i)revgeocode(as.numeric(zipcode.na[i,4:3])))) # takes almost 5 mins to run
# in case it cannot run, I also put the result in the folder
# save(result, file="result.Rdata")
# load("~(the working directory)/result.Rdata") #please run this line if the above code doesn't work properly.
setDT(zipcode.na)
zipcode.na[, result := result]
zipcode.na$zipcode_new <- substr(str_extract(zipcode.na$result," [0-9]{5}, .+"),2,6)
# check if there's still na
zipcode.na[is.na(zipcode.na$zipcode_new),]
##         id zipcode latitude longitude
## 1: 4896855    <NA> 40.83314 -73.91888
##                                               result zipcode_new
## 1: Grand Concourse/MC Clellan St, The Bronx, NY, USA        <NA>
# <NA> Grand Concourse/MC Clellan St, The Bronx, NY, USA -> 10452
zipcode.na[is.na(zipcode.na$zipcode_new),]$zipcode_new <- '10452'
# keep id and zipcode to merge with revenue table and replace the NAs
zipcode.na <- zipcode.na[,-c(2:5)]
revenue <- left_join(revenue, zipcode.na, by = "id")
setDT(revenue)
revenue[is.na(zipcode), zipcode := zipcode_new]
revenue[, zipcode_new := NULL]
# check if theres still NA in zip code 
revenue[is.na(revenue$zipcode),]
## Empty data.table (0 rows) of 33 cols: id,last_scraped,street,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed...

There’s no NA in the zip code column

  • Unify Zip Code to 5 Digits
# Unify zip code to 5 digits
# Valid Data Entries: A valid zipcode should have length = 5. Checking for number of zipcodes with valid length.
# for length >5, keep the first 5
# dfnew <- df[, zipcode:=as.character(zipcode)] # no use
#df[ ,nchar(df$zipcode) > 5] <- substr(df[nchar(df$zipcode) > 5]$zipcode, 1, 5)
revenue[nchar(revenue$zipcode) > 5, zipcode :=substr(revenue[nchar(revenue$zipcode) > 5]$zipcode, 1, 5)]
# for <5, convert lat and long to zipcode
# 1. select <5 values 
zipcode5 <- revenue[,c('id','zipcode','latitude','longitude')]
zipcode5 <- zipcode5[nchar(zipcode5$zipcode) < 5, ]
# 2. generate zip code based on lat and long 
result2 <- do.call(rbind,
                  lapply(1:nrow(zipcode5),
                         function(i)revgeocode(as.numeric(zipcode5[i,4:3])))) # takes about one minute to run
setDT(zipcode5)
zipcode5[, result := result2]
zipcode5$zipcode_new <- substr(str_extract(zipcode5$result," [0-9]{5}, .+"),2,6)
zipcode5 <- zipcode5[,-c(2:5)]
revenue <- left_join(revenue, zipcode5, by = "id")
setDT(revenue)
revenue[nchar(revenue$zipcode) < 5, zipcode := zipcode_new]
revenue[, zipcode_new := NULL]
# check if the length all converts to 5 digits 
revenue[nchar(revenue$zipcode) != 5,]
## Empty data.table (0 rows) of 33 cols: id,last_scraped,street,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed...

Number of Rooms

Most of properties in the data is one bedroom home/apt or one bedroom private room. Based on the assumption give in the problme statement, I chose two bed room properties here. In the next step I converted the price by times 2. But we should keep in mind the fact that it includes some price for two bedrooms private room, which should be lower than that for entire home/apt. So I underestimated the price here.

############################################################################################# Bedrooms 
revenue %>% 
  group_by(room_type,bedrooms) %>% 
  summarise(no_properties = n())
## # A tibble: 21 x 3
## # Groups:   room_type [?]
##    room_type       bedrooms no_properties
##    <chr>              <int>         <int>
##  1 Entire home/apt        0          3369
##  2 Entire home/apt        1         10104
##  3 Entire home/apt        2          4593
##  4 Entire home/apt        3          1349
##  5 Entire home/apt        4           340
##  6 Entire home/apt        5            81
##  7 Entire home/apt        6            27
##  8 Entire home/apt        7             8
##  9 Entire home/apt        8             7
## 10 Entire home/apt        9             2
## # ... with 11 more rows
revenue1 <- revenue[bedrooms==2,]
# 23896 obs 

Price

############################################################################################# Price
# remove the dollar and comma sign and change it into numeric for further calculation
revenue1$price <- gsub('\\$', '', revenue1$price)
revenue1$price_new <- as.numeric(gsub(",", "", revenue1$price))
revenue1[room_type == 'Private room', price_new := price_new*2]
############################################################################################# date format
revenue1$last_scraped <- as.Date(revenue1$last_scraped,format = '%Y-%m-%d')
revenue1$first_review <- as.Date(revenue1$first_review,format = '%Y-%m-%d')
revenue1$last_review <- as.Date(revenue1$last_review,format = '%Y-%m-%d')

Cost Data

First subset the dataset by filtering the city name that is “New York”. There are 25 unique zip codes from four boroughs: Brooklyn, Manhattan, Queens, Staten Island in the subset dataset. * County name + Convert it to borough names for further analysis * Quality check: + There’re missing values from 1996-04 to 2007-05 for median price. + There’s no duplication in the cost dataset.

After data cleaning, there are 25 unique observations, each representing a zip code in the dataset.

Zip Code

Based on the assumption that all properties and all square feet within each locale can be assumed to be homogeneous, created the column of the boroughs each zip code belongs to.

# change the column 'CountyName' to 'neighbourhood_group_cleansed'
colnames(zillow.ny)[5] <- 'borough'
# convert the county name to borough name based on definition                
zillow.ny$borough <-  ifelse(zillow.ny$borough %in% c('Queens','Bronx'), zillow.ny$borough, ifelse(zillow.ny$borough == 'Kings', 'Brooklyn', ifelse(zillow.ny$borough == 'New York', 'Manhattan','Staten Island')))

Quality Check

Historical Prices Change

# for rShiny
# keep price from 2007-2017
df1 <- zillow.ny[,c(1,5,141:261)]
# reshape wide format to long format
price <- df1 %>%
        gather('time', 'price',"2007-06":"2017-06")
price$time2 <- gsub("[: -]", "" , price$time, perl=TRUE)
# change format of time 
price$time <- as.Date(paste(price$time,"-01",sep=""))
# for rShiny app
write.csv(price,file = 'cost.csv')

Merge Two Datasets

# use the central moving average to represent the cost
zillow.ny1 <- zillow.ny[,c('RegionName','borough',
                           '2017-02','2017-03','2017-04','2017-05','2017-06')] %>%
  mutate(ma_5 = (`2017-02`+`2017-03`+`2017-04`+`2017-05`+`2017-06`)/5)
# merge two datasets (inner join)
alldt <- merge(revenue1, zillow.ny1, by.x = 'zipcode', by.y = 'RegionName')

After capping, the distribution seems still to be influenced by outliers.

Quality Check for All Data

Remove “fake listings”. there are 17 rows removed.

fakelistings <- alldt[((alldt$last_review < as.Date('2015-05-01')) & 
                        (alldt$availability_30 == 0) & 
                        (alldt$availability_60 == 0) &
                        (alldt$availability_90 == 0) &
                        (alldt$availability_365 == 0)),]
# remove the listings
alldt <- alldt[((alldt$last_review >= as.Date('2015-05-01')) |
                  is.na(alldt$last_review)|
                        (alldt$availability_30 != 0) | 
                        (alldt$availability_60 != 0) |
                        (alldt$availability_90 != 0) |
                        (alldt$availability_365 != 0)),]

Metadata Created for Further Evaluation

Cost

Given the observations above and the assumption I have for the time, I used centered moving average to represent the property price in April 2017 in order to reduce the noise and uncover patterns in the data. In particular, I calculated a 5-point moving average. The code can be found right before the merge code. \(y_t = \frac{y_{t-2} + y_{t-1} + y_t + y_{t+1} + y_{t+2}}{5}\)

Occupancy Rate

Occupancy rate for a short-term vacation rental is the number of booked nights divided by the sum of the available nights and booked nights.

  • Availability is not representative, the time that the host updated the calender range from several days ago to several years ago.
  • Review scores based on locations of property.
  • Use 0.75 as the occupancy rate for all the properties. Keep in mind that here the occupancy rate is over-estimated for certain properties that have lower demand. Same are Return on Investment and Cap Rate.
# construct occupancy rate 
alldt$or <- 0.75
# get the year for first review
alldt$year <- substr(alldt$first_review,1,4)

Profitability Metrics

Here I used the unit of time as one year. I calculated the annual revenue based on the occupancy rate and median price for each zip code.

I chose three metrics to evaluate the profitability for each zip code as they focus on different aspects for investment: firstly, it’s Return on Investment (ROI), which measures the efficiency of the investment; second is Break-Even Period, which measures how long does it take to pay back the initial cost; third is total number of reviews, which measures the demand for the zip code. I included these three metrics as “there is no single metric that will give all the information you need to make the best choices possible about real estate investment.” In addition, since I assume the occupancy to be the same across zip codes, which is not realistic, it’s important to incorporate the metric that could reflect actual demand for each zip code in the analysis.

  • ROI

\(ROI_i = \frac{Current Investment Value-Cost of Investment}{Cost of Investment} = \frac{Revenue_i+Appreciation Value_i-Purchase Price}{Cost of Investment}\)

in which \(Revenue_i\) represents the total revenue generated until \(year_i\), \(Appreciaion value_i\) represents the total appreciation value of the property until \(year_i\). For now, I only consider \(Revenue_i\) given time limitation. I set \(i\) as 1, 5, 10, 15, 20, 25, 30, 35, 40 respectively so that I could analyze the profitability for each zip code in different time frames. And I chose and divided it by 40 to get the annuzlized ROI rate for convenience of comparison between different zip code.

  • Breakeven Period measure how long does it take for the investment to pay back for each zip code.

\(Annual Net Operating Income (NOI) = 365 × Occupancy Rate × Daily Revenue\)

\(Breakeven Period = \frac{Innital Cost}{Annual NOI}\)

  • rev_peryear: Calculate revenue per year
  • beperiod: Calculate break even period
# create median price for each zip code
all_zp <- alldt %>%
        group_by(zipcode) %>%
        dplyr::summarize(neighbourhood = paste(unique(neighbourhood_group_cleansed)), 
                         med_rev = median(price_new), 
                         mean_rev = mean(price_new), cost = mean(ma_5), or = mean(or), 
                         tot_review = sum(number_of_reviews), 
                         ave_review = mean(number_of_reviews),n=n(),
                         long = round(mean(longitude),5),lat=round(mean(latitude),5)
                         )
all_zp <- as.data.frame(all_zp)
#all_zp <- merge(all_zp,df1[,-2],by.x='zipcode',by.y='RegionName')
all_zp$rev_peryear <- 0
all_zp$beperiod <- 0
# revenue generated per year is time * occupancy rate * median price 
all_zp$rev_peryear <- 365*all_zp$or*all_zp$med_rev
# cap rate equals to annual noi / total cost
all_zp$cr <- all_zp$rev_peryear/all_zp$cost*100
# break-even period equals to 1/cap rate
all_zp$beperiod <- all_zp$cost/all_zp$rev_peryear
# profitability: ROI in year 1, year 5 to year 40 
all_zp <- all_zp %>% mutate(year_01 = -(cost - rev_peryear)/cost,
                            year_05 = -(cost - 5*rev_peryear)/cost,
                            year_10 = -(cost - 10*rev_peryear)/cost,
                            year_15 = -(cost - 15*rev_peryear)/cost,
                            year_20 = -(cost - 20*rev_peryear)/cost,
                            year_25 = -(cost - 25*rev_peryear)/cost,
                            year_30 = -(cost - 30*rev_peryear)/cost,
                            year_35 = -(cost - 35*rev_peryear)/cost,
                            year_40 = -(cost - 40*rev_peryear)/cost,
                            annual_roi = year_40/40)
# create data table for rShiny
top_n <- data.frame(matrix(ncol = 0, nrow = nrow(all_zp)))
top_n$return_on_investment <- all_zp[order(-all_zp$year_01),]$zipcode
top_n$break_even_period <- all_zp[order(all_zp$beperiod),]$zipcode
top_n$total_reviews <- all_zp[order(-all_zp$tot_review),]$zipcode

Explotary Data Analysis

Revenue Analysis

Quantity Analysis

Renting Price Analysis

## Source : https://maps.googleapis.com/maps/api/staticmap?center=New%20York&zoom=11&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=New+York&key=xxx
## Warning: Removed 37 rows containing non-finite values (stat_density2d).

## Warning: Removed 37 rows containing non-finite values (stat_density2d).

I didnt use availability to forcast occupancy rate because hosts updated their properties’ availability different time.

Visualization of Profitability Metrics

Combine each metric together and plot them as below.

I listed the top 10 zip codes for each metrics, and the zip codes that are listed in each metrics will be recommended.
By viewing the aggregated table, I find that 10036, 10025, 11231 meet the requirments: a relatively high annual ROI, a relatively short break-even period.
Top 10 Zip Codes for Individual Metric
return_on_investment break_even_period total_reviews
10312 10312 10003
10304 10304 10036
11434 11434 11215
10305 10305 11217
10306 10306 10025
11234 11234 10014
10036 10036 10013
10025 10025 10011
10308 10308 11201
11231 11231 11231

Tidying up profit summary data for visulization. Using the plot below we can see how ROI changes over time for each zip code. By year 40, ROIs for all the zip codes are above zero and 10312 has the highest ROI, which aligns with the figure above.

Next Steps